Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBL-2791: Enable actor stack trace mechanism #1382

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

borrrden
Copy link
Member

This feature will keep track of two pieces of information:

  1. For any given execution, the path of enqueue and execution calls that led to the execution
  2. For any given actor, the linear history of enqueue and execution calls (regardless of source)

If an exception occurs, this information is dumped to the logs.

This feature will keep track of two pieces of information:

1. For any given execution, the path of enqueue and execution calls that led to the execution
2. For any given actor, the linear history of enqueue and execution calls (regardless of source)

If an exception occurs, this information is dumped to the logs.
@borrrden borrrden marked this pull request as draft February 17, 2022 00:19
@borrrden
Copy link
Member Author

Mark as draft until resolution of iOS simulator issue

It is not supported in iOS simulator, and since the underlying unit is a queue instead of a thread it makes more logical sense to make use of dispatch_queue_set_specific
@borrrden borrrden marked this pull request as ready for review February 17, 2022 01:33
@borrrden
Copy link
Member Author

@snej Ths GCD implementation of this was a bit awkward. Let me know if you have any ideas about how to improve it. It's complicated by the fact that the manifest used as the "queue manifest" needs to live long enough to be used by however many recursive calls happen in a given execution. For the threaded mailbox this just meant using a thread local static shared_ptr and copying it to each context that uses it. However, thread_local is not allowed in iOS simulator and doesn't really fit well with the queue based logic so I tried to make use of the set_specific API, which needs a pointer. So when retrieving, a pointer to the shared_ptr is retrieved and then deferenced (i.e. copied).

@snej
Copy link
Collaborator

snej commented Feb 19, 2022

Do we need this for Apple platforms? It sounds like the same info that Xcode's debugger already shows. (At least item 1.)

@borrrden
Copy link
Member Author

That's great if you are running in a debugger, but I'm thinking about this information being put into our logs so that we can have it even from the field.

@snej
Copy link
Collaborator

snej commented Feb 19, 2022

That is a lot of overhead to add to production builds! I think I'd need to be convinced that this is necessary. I can see that it would be useful in some occasions, but it would be slowing everything down and adding to memory bloat.

Is this something that can be disabled except by a runtime flag?

@borrrden
Copy link
Member Author

borrrden commented Feb 19, 2022

I'm trying to balance things out here. If it is enabled or disabled at runtime then I guarantee we get into a case where it's off and we end up back where we started -> with a bunch of intertwined logs that make it difficult to navigate through the flow of an issue of "replicator getting stuck" without context of the calls that led to that point. I disagree that it adds an excessive amount of memory pressure since it's going to prune out entries as it receives new ones (I'm certainly open to decreasing the number of entries that are saved though). I'm going to start proposing a lot of changes like this because our logs in general often leave us puzzled as to what is going on. I want a way to rectify this situation by collecting some data about the state of the program that can be accessed on demand or at exception time in this case. Whether or not having thread / queue local stuff adds too much of a performance penalty is something I could debate about.

In short we need something here to help us navigate an actor based world in which the most common form of bug is a race condition or hang. Simply logging things is not enough. What I am after is an answer to the question "in what order did things happen in order to get here?"

EDIT I also thought of adding this information to logs instead, but I figured that collecting it in memory would be overall better for performance than logging it all in realtime.

@snej
Copy link
Collaborator

snej commented Feb 19, 2022

We really need performance testing in CI so we can see whether a new feature like this affects performance...

@borrrden
Copy link
Member Author

borrrden commented Feb 19, 2022

That's something that could be arranged I think. The only problem is that there is no clear pass/fail metric to use in CI that I can think of.

@borrrden borrrden marked this pull request as draft February 21, 2022 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants